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Abstract 

Multi-task learning is a natural approach for computer vision applications that 
require the simultaneous solution of several distinct but related problems, e.g. ob¬ 
ject detection, classification, tracking of multiple agents, or denoising, to name a 
few. The key idea is that exploring task relatedness (structure) can lead to improved 
performances. 

In this paper, we propose and study a novel sparse, non-parametric approach 
exploiting the theory of Reproducing Kernel Hilbert Spaces for vector-valued func¬ 
tions. We develop a suitable regularization framework which can be formulated 
as a convex optimization problem, and is provably solvable using an alternating 
minimization approach. Empirical tests show that the proposed method compares 
favorably to state of the art techniques and further allows to recover interpretable 
structures, a problem of interest in its own right. 


1 Introduction 

Several problems in computer vision and image processing, such as object detec¬ 
tion/classification, image denoising, inpainting etc., require solving multiple learning 
tasks at the same time. In such settings a natural question is to ask whether it could 
be beneficial to solve all the tasks jointly, rather than separately. This idea is at the 
basis of the field of multi-task learning, where the joint solution of different problems 
has the potential to exploit tasks relatedness (structure) to improve learning. Indeed, 
when knowledge about task relatedness is available, it can be profitably incorporated 
in multi-task learning approaches for example by designing suitable embedding/coding 
schemes, kernels or regularizers, see Gol[Iol[Il[III[I3. 

The more interesting case, when knowledge about the tasks structure is not known 
a priori, has been the subject of recent studies. Largely influenced by the success of 
sparsity based methods, a common approach has been that of considering linear models 
for each task coupled with suitable parameterization/penalization enforcing task relat¬ 
edness, for example encouraging the selection of features simultaneously important for 
all tasks m or for speciflc subgroups of related tasks |[T3l|T4|[23[T5l[T2l[T6l. Other 
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linear methods adopt hierarchical priors or greedy approaches to recover the taxonomy 
of tasks 1221 [24]|. A different line of research has been devoted to the development of 
non-linear/non-parametric approaches using kernel methods - either from a Gaussian 
process HI 123 or a regularization perspective mia. 

This paper follows this last line of research, tackling in particular two issues only 
partially addressed in previous works. The first is the development of a regulariza¬ 
tion framework to learn and exploit the tasks structure, which is not only important 
for prediction, but also for interpretation. Towards this end, we propose and study a 
family of matrix-valued reproducing kernels, parametrized so to enforce sparse rela¬ 
tions among tasks. A novel algorithm dubbed Sparse Kernel MTL is then proposed 
considering a Tikhonov regularization approach. The second contribution is to provide 
a sound computational framework to solve the corresponding minimization problem. 
While we follow a fairly standard alternating minimization approach, unlike most pre¬ 
vious work we can exploit results in convex optimization to prove the convergence of 
the considered procedure. The latter has an interesting interpretation where supervised 
and unsupervised learning steps are alternated: first, given a structure, multiple tasks 
are learned jointly, then the structure is updated. We support the proposed method with 
an experimental analysis both on synthetic and real data, including classification and 
detection datasets. The obtained results show that Sparse Kernel MTL can achieve 
state of the art performances while unveiling the structure describing tasks relatedness. 

The paper is organized as follows: in Sec. we provide some background and 
notation in order to motivate and introduce the Sparse Kernel MTL model. In Sec. we 
discuss an alternating minimization algorithm to provably solve the learning problem 
proposed. Finally, we discuss empirical evaluation in Sec.|^ 

Notation. With S'^_^ C 5'^ C S'^ c we denote respectively the space of 

positive definite, positive semidefinite (PSD) and symmetric nxn real-valued matrices. 

denotes the space of orthonormal nxn matrices. For any M G denotes 

the transpose of M. For any PSD matrix A e S!^, G 5 '^ denotes the pseudoinverse 
of A. We denote by In G 5 '^^ the n x n identity matrix. We use the abbreviation 
l.s.c. to denote lower semi-continuous functions (i.e. functions with closed sub-level 
sets) m. 


2 Model 


We formulate the problem of solving multiple learning tasks as that of learning a vector¬ 
valued function whose output components correspond to individual predictors. We 
consider the framework originally introduced in where the well-known concept of 
Reproducing Kernel Hilbert Space is extended to spaces of vector-valued functions. In 
this setting the set of tasks relations has a natural characterization in terms of a positive 
semidefinite matrix. By imposing a sparse prior on this object we are able to formulate 
our model. Sparse Kernel MTL, as a kernel learning problem designed to recover the 
most relevant relations among the tasks. 

In the following we review basic definitions and results from the theory of Re¬ 


producing Kernel Hilbert Spaces that will allow in Sec. 2^ to motivate and introduce 
our learning framework. In Sec. |2.2.2] we briefly draw connections of our method to 
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previously prosed multi-task learning approaches. 


2.1 Reproducing Kernel Hilbert Spaces for Vector-Valued Func¬ 


tions 


We consider the problem of learning a function f \ X -^y from a set of empirical ob¬ 
servations {(xi, yi)}'i^i with Xi e X and G 3^ C . This setting includes learning 
problems such as vector-valued regression (= M^), multi-label/detection for T tasks 
(y = {0,1}^) or also T-class classification (where we adopt the standard one-vs-all 
approach mapping the t-th class label to the t-th element et of the canonical basis in 
M^). Following the work of Micchelli and Pontil 1^ . we adopt a Tikhonov regular¬ 
ization approach in the setting of Reproducing Kernel Hilbert Spaces for vector-valued 
functions (RKHSvv). RKHSvv are the generalization of the well-known RKHS to the 
vector-valued setting and maintain most of the properties of their scalar counterpart. 
In particular, similarly to standard RKHS, RKHSvv are uniquely characterized by an 
operator-valued kernel: 

Definition 2.1. Let X be a set and (1-L^ be a Hilbert space of functions from X 

to My. A symmetric, positive definite, matrix valued function T \ X x X ^ 
is called a reproducing kernel for H if for all x G X^c G and f G H we 
have that r(x, •)c G H and the following reproducing property holds: (/(^), c)]^t = 




Analogously to the scalar setting, a Representer theorem holds, stating that the 
solution to the regularized learning problem 



( 1 ) 


is of the form /(•) = Yl7=i r(', with q G M^, T the matrix-valued kernel asso¬ 
ciated to the RKHSvv H and V : y x M^ a loss function (e.g. least squares, 

hinge, logistic, etc.) which we assume to be convex. We point out that the setting 
above can also account for the case where not all task outputs yi = {yn,, yir)^ 
associated to a given input Xi are available in training. Such situation would arise for 
instance in multi-detection problems in which supervision (e.g. presence/absence of an 
object class in the image) is provided only for a few tasks at the time. 

2.1.1 Separable Kernels 

Depending on the choice of operator-valued kernel T, different structures can be en¬ 
forced among the tasks; this effect can be observed by restricting ourselves to the fam¬ 
ily of separable kernels. Separable kernels are matrix-valued functions of the form 
r(x,x') = k{x,x')A, where k : X x X ^ M is a. scalar reproducing kernel and 
A G a T X T positive semidefinite (PSD) matrix. Intuitively, the scalar kernel 
characterizes the individual tasks functions, while the matrix A describes how they 
are related. Indeed, from the Representer theorem we have that solutions of prob¬ 
lem 0 are of the form /(•) = k{',Xi)Aci with the t-th task being /t(-) = 
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Sr=i Ct)]^T, a scalar function in the RKHS Hk associated to kernel k. As 

shown in cni, in this case the squared norm associated to the separable kernel kA in 
the RKHSvv H, can be written as 

T 

\\f\\H = Y.^ls{fufs)H, ( 2 ) 

t,S 

with ^4^^ the {t, s)-th entry of 74 ’s pseudo-inverse. 

Eq. Q shows how A can model the structural relations among tasks by directly 
coupling predictors: for instance, by setting + 7 ( 11 ^)/T, with 1 G MA the 

vector of all Is, we have that the parameter 7 controls the variance \\f ~ 

of the tasks with respect to their mean / = ^ ft- If we have access to some 
notion of similarity among tasks in the form of a graph with adjacency matrix W G 
S'^, we can consider the regularizer Ylt,s=i WtsWft - fs\\n^ +'iT.t ll/tllwfc which 
corresponds to setting ^4^ = L -|- 'fix with L the graph Laplacian induced by W. We 
refer the reader to nni for more examples of possible choices for A when the tasks 
structure is known. 


2.2 Sparse Kernel Multi Task Learning 


When a-priori knowledge of the problem structure is not available, it is desirable to 


learn the tasks relations directly from the data. In light of the observations of Sec. 2.1.1 


a viable approach is to parametrize the RKHSvv H in problem ([T]) with the associated 
separable kernel kA and to optimize jointly with respect to both / G and A G 5'^. 
In the following we show how this problem corresponds to that of identifying a set 
of latent tasks and to combine them in order to form the individual predictors. By 
enforcing a sparsity prior on the set of such possible combinations, we then propose the 
Sparse Kernel MTL model, which is designed to recover only the most relevant tasks 


relations. In Sec. 2.2.2 we discuss, from a modeling perspective, how our framework 
is related to the current multi-task learning literature. 


2.2.1 Recovering the Most Relevant Relations 

From the Representer theorem introduced in Sec. |2.1| we know that a candidate solution 
f \ X ^ MA to problem Q can be parametrized in terms of the maps k{',Xi), by 
a structure matrix A G 5'^ and a set of coefficient vectors ci ,..., G MA such 
that /(•) = k{',Xi)Aci. If now we consider the t-th component of / (i.e. the 

predictor of the t-th task), we have that 

n T 

ft{’) ^ ^ ^ ^ ^tsQs (’) (3) 

i=l s=l 

where we set gs{’) = AAJi=i ^('5 ^i)^is ^ 'kLu for s G {1,..., T} and Cis G M the 5 -th 
component of q . Eq. ^ provides further understanding on how A can enforce/describe 
the tasks relations: The gs can be interpreted as elements in a dictionary and each ft 
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factorizes as their linear combination. Therefore, any two predictors ft and ft' are 
implicitly coupled by the subset of common gs. 

We consider the setting where the tasks structure is unknown and we aim to recover 
it from the available data in the form of a structure matrix A. Following a denois- 
ing/feature selection argument, our approach consists in imposing a sparsity penalty 
on the set of possible tasks structures, requiring each predictor ft to be described by 
a small subset of gs. Indeed, by requiring most of A's entries to be equal to zero, we 
implicitly enforce the system to recover only the most relevant tasks relations. The 
benefits of this approach are two-fold: on the one hand it is less sensitive to spurious 
statistically non-significant tasks-correlations that could for instance arise when few 
training examples are available. On the other hand it provides us with interpretable 
tasks structures, which is a problem of interest in its own right and relevant, for exam¬ 
ple, in cognitive science ca. 

Following the de-facto standard choice of ^i-norm regularization to impose sparsity 
in convex settings, the Sparse Kernel MTL problem can be formulated as 

1 

minimize 

^(II/IIh + + (1 “ m) ll^lki) (4) 

where \\A\\i^ = ^ \ Ats\, V : y x is a loss function and A > 0, e > 0, 

and g G [0,1] regularization parameters. Here g G [0,1] regulates the amount of 
desired entry-wise sparsity of A with respect to the low-rank prior tr{A) (indeed notice 
that for /i = 1 we recover the low-rank inducing framework of ElEU). This prior 
was empirically observed (see mM) to indeed encourage information transfer across 
tasks; the sparsity term can therefore be interpreted as enforcing such transfer to occur 
only between tasks that are strongly correlated. Finally the term etr(A“^) ensures 
the existence of a unique solution (making the problem strictly convex), and can be 
interpreted as a preconditioning of the problem (see Sec. |3.2| ). 

Notice that the term ||/|||^ depends on both / and A (see Eq. [^, thus making 
problem 0 non-separable in the two variables. However, it can be shown that the 
objective functional is jointly convex in / and A (we refer the reader to the Appendix 
for a proof of convexity, which extends results in HD to our setting). This will allow 
in Sec. to derive an optimization strategy that is guaranteed to converge to a global 
solution. 

2.2.2 Previous Work on Learning the Relations among Tasks 

Several methods designed to recover the tasks relations from the data can be formu¬ 
lated using our notation as joint learning problems in / and A. Depending on the 
expected/desired tasks-structure a set of constraints A C can be imposed on A 
when solving a joint problem as in 0 : 

• Multi-task Relation Learning ||28l. In ||28l, the relaxation A = {A\ tr(A) < 
1} of the low-rank constraint is imposed, enforcing the tasks ft to span a low¬ 
dimensional subspace in This method can be shown to be approximately 
equivalent to lO. 
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• Output Kernel Learning lH . Rather than imposing a hard constraint, the au¬ 
thors penalize the structure matrix A with the squared Frobenius norm ||^|||^. 

• Cluster Multi-task Learning ifTSll . Assuming tasks to be organized into distinct 
clusters, in ca a learning scheme to recover such structure is proposed, which 
consists of imposing a suitable set of spectral constraints A on A. We refer the 
reader to the supplementary material for further details. 

• Learning Graph Relations O. Following the interpretation in ifTOl reviewed 
in Sec. |2.1.l] of imposing similarity relations among tasks in the form of a graph, 
in la the authors propose a setting where a (relaxed) Graph Laplacian constraint 
is imposed on A. 


3 Optimization 

Due to the clear block variable structure of Eq. © with respect to / and A, we propose 
an alternating minimization approach (see Alg.[^ to iteratively solve the Sparse Kernel 
MTL problem by keeping fixed one variable at the time. This choice is motivated by 
the fact that for a fixed A, problem Q reduces to the standard multi-task learning prob¬ 
lem Q, for which several well-established optimization strategies have already been 
considered |[T][20l|T0|[2Tl. The alternating minimization procedure can be interpreted 
as iterating between steps of supervised learning (finding the / that best fits the input- 
output training observations) and unsupervised learning (finding the best A describing 
the tasks structure, which does not involve the output data). 


3.1 Solving w.r.t. / (Supervised Step) 


Let A e be a fixed structure matrix. From the Representer theorem (see Sec. 2.1) 
we know that the solution of problem Q is of the form /(•) = with 

Ci G Depending on the specific loss V, different methods can be employed to find 


such coefficients q. In particular, for the least-square loss a closed form solution can 
be derived by taking the coefficient vector c = (cj,..., cj)~^ G to be HI: 


c = (A0K + M„t) 


(5) 


where if e S'” is the empirical kernel matrix associated to k the scalar kernel, y € 
is the vector concatenating the training outputs ,..., ^ and (g) denotes 

the Kronecker product. A faster and more compact solution was proposed in tT\} by 
adopting Sylvester’s method. 


3.2 Solving w.r.t the Tasks Structure (Unsupervised Step) 

Let / be known in terms of its coefficents ci,..., G Our goal is to find the 
structure matrix A G 5'^^ that minimizes problem (|^. Notice that each task ft can be 
written as/t(-) = ElLi with6i,t = {At,Ci)^T. 
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Algorithm 1 Alternating Minimization 

Input: K empirical kernel matrix, y training outputs, d tolerance, V loss. A, /i, e 
hyperparameters, S objective functional of problem 
Initialize: /o = 0, Aq = It and i = 0 

repeat 

fiTi ^ SupervisedStep {V, K, y, Ai, A) 

Ai^l ^ SPARSEKERNELMTL(Ar, 
i ^ i + 1 
until |S'(/i+i 


Therefore, from eq. ^ we have 

T T 

ll/llw = XI = XX (6) 

i,ji 

where we have used the reproducing property of Hk for the last equality. Eq. 
allows to write the norm induced by the separable kernel kA in the more compact 
matrix notation ||/|||^ = tT{B^KBA~^), where B G is the matrix with (i, t)- 

th element Bu = bu. 

Under this new notation, problem 0 with fixed / becomes 

min. tT{A~^{B^KB + elr)) + /itr(A) + (1 - /i) ||A||^, (7) 

from which we can clearly see the effect of e as a preconditioning term for the tasks 
covariance matrix B^KB. 

By employing recent results from the non-smooth convex optimization literature, 
in the following we will describe an algorithm to optimize the Sparse Kernel MTL 
problem. 

3.2.1 Primal-dual Splitting Algorithm 

First order proximal splitting algorithms have been successfully applied to solve con¬ 
vex composite optimization problems, that can be written as the sum of a smooth com¬ 
ponent with nonsmooth ones El. They proceed by splitting, i.e. by activating each 
term appearing in the sum individually. The iteration usually consists of a gradient 
descent-like step determined by the smooth component, and various proximal steps 
induced by the nonsmooth terms El- In the following we will describe one of such 
methods, derived in (261171, to solve the Sparse Kernel MTL problem in eq. ([T]). The 
proposed method is primal-dual, in the sense that it also provides an additional dual 
sequence solving the associated dual optimization problem. We will rely on the sum 
structure of the objective function, that can be written as G(•) + i7i (•) + 772 (*)) ’ with 

G(A) = A/itr(A), Hi{A) = A(1 - m)II^IIg and H 2 {A) = Aetr(A-i) -h 
where is the indicator function of a 5'^^ (0 on the set +oo outside) and enforces 
the hard constraint A G 5'^+- T is a linear operator defined as L{A) = MAM, where 
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we have set M = {B^KB + We recall here that a square root of a PSD 

matrix P G 5'^ is a PSD matrix M G such that P = MM. Note that G is smooth 
with Lipschitz continuous gradient, L is a linear operator and both Hi and H 2 are 
functions for which the proximal operator can be computed in closed form. We recall 
that the proximity operator at a point y G of a proper, convex and l.s.c. function 
H : ^ M U {+ 00 }, is defined as 

pr:oxff{y) = argmin |iJ(a;) + ^||a;-y|N. (8) 


It is well known that for any y > 0, the proximal map of the ii norm 77 1| • \\i^ is the 
so-called soft-thresholding operator which can be computed in closed form. The 
following result provides an explicit closed-form solution also for the proximal map of 
H2. 

Proposition 3.1. Let Z G with eigendecomposition Z = UTiU~^ with U G 
orthonormal matrix and S G diagonal. Then 


= argmin|tr(A + I||A - Zfp\. 


Aesl^ 


(9) 


can be computed in closed form as pioyijj^{Z) = U AU^ with A G diagonal 
matrix with Au the only positive root of the polynomial p{X) = — X^T^u — 1 with 


A G M. 


Proof. Note that H 2 is convex and Isc. Therefore the proximity operator is well-defined 
and the functional in (|^ has a unique minimizer. Its gradient is —A~‘^ + A — Z, 
therefore, the first order condition for a matrix A to be a minimizer is 

^3 _ ^ 2 ^ - /^ = 0 (10) 

We show that it is possible to find A G diagonal such that A^ = UAU^ solves 
eq. ([igi. Indeed, for A with same set of eigenvectors U as Z, we have that eq. 
becomes U (A^ —A^'Z—It)U^ = 0, which is equivalent to the set of T scalar equations 
— 1 = 0 for t G {1, ..., T} and A G M. Descartes rule of sign 1^ assures 
that for any ^ ^ each of these polynomials has exactly one positive root, which 
can be clearly computed in closed form. □ 

We have the following result as an immediate consequence. 

Theorem 3.2 (Convergence of Sparse Kernel MTL, Eiia). Let k be a scalar ker¬ 
nel over a space X, xi^... G X a set of points and f : X ^ MA a function 
characterized by a set of coefficients 61 ,..., 6 ^ G MA so that /(•) = k{',Xi)bi. 

Set K e Sf to be the empirical kernel matrix associated to k and the points {xi}'^^^ 
and B the matrix whose i-th row corresponds to the (transposed) coefficient 

vector bi. 

Then, any sequence of matrices At produced by Algorithm ^ converges to a global 
minimizer of the Sparse Kernel MTL problem 0 ( or, equivalently, to for fixed f. 
Furthermore, the sequence Dt converges to a solution of the dual problem of 0- 
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Algorithm 2 Sparse Kernel MTL 

Input: K G 5'^, B G 5 tolerance, 0</i<l,e>0 hyperparameter. 

Initialize: Aq.Dq g 5'++, M = cr = ||M|p squared maxi¬ 

mum eigenvalue of M. i = 0 

repeat 

Ai+i <r- proxi^||.||^^ (Ai - + MDiM)) 

P ^ A + lM{2Ai+^ - Ai)M 
A+i ^ P- prox^jy^ (crP) 

i ^ i 1 

until ||Ai+i - Ai\\F < s and ||A+i - A||f < S 


3.3 Convergence of Alternating Minimization 

We additionally exploit the sum structure and the regularity properties of the objective 
functional in 0 to prove convergence of the alternating minimization scheme to a 
global minimum. We rely on the results in 1^ . In particular, the following result is a 
direct application of Theorem 4.1 in that paper. 

Theorem 3.3. Under the same assumptions as in Theoren \3.2\ the sequence {fi , 
generated by Algorithm\I^is a minimizing sequence for Problem^and converges to its 
unique solution. 

Proof Let S denote the objective function in 0. First note that the level sets of S are 
compact due to the presence of the term etr(A“^) + /itr(A) and that S is continuous 
on each level set. Moreover, since S is regular at each point in the interior of the domain 
and is convex, (251 Theorem 4.1(c)] implies that each cluster point of {fi, Ai)i^fq is the 
unique minimizer of S. Then, the sequence itself is convergent and is minimizing by 
continuity. □ 

3.3.1 A Note on Computational Complexity & Times 

Regarding the computational costs/number of iterations required for the convergence 
of the whole Alg. up to our knowledge the only results available on rates for Al¬ 
ternating Minimization are in 0. Unfortunately these results hold only for smooth 
settings. Notice however that each iteration of Alg is of the order of O(T^), (the 
eigendecomposition of A being the most expensive operation) and its convergence rate 
is 0{l/k) with k equal to the number of iterations. Hence, Alg. is not affected by 
the number n of training samples. On the contrary, the supervised step in Agl. □ (e.g. 
RLS or S VM) typically requires the inversion of the kernel matrix K (or some approx¬ 
imation of its inverse) whose complexity heavily depends on n (order of 0{n^) for 
inversion). Furthermore, the product BKB^ costs 0{in?T) which, since n » T, is 
more expensive than Alg.[2 Thus, with respect to n SKMTL scales exactly as methods 
such as [2,7,24]. 
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Figure 1: Generalization performance (nMSE and standard deviation) of different 
multi-task methods with respect to the sparsity of the task structure matrix. 


4 Empirical Analysis 

We report the empirical evaluation of SKMTL on artificial and real datasets. We have 
conducted experiments on both artificially generated and real dataset to assess the ca¬ 
pabilities of the proposed Sparse Kernel MTL method to recover the most relevant 
relations among tasks and exploit such knowledge to improve the prediction perfor¬ 
mance. 

4.1 Synthetic Data 

We considered an artificial setting that allows us to control the tasks structure and 
in particular the actual sparsity of the tasks-relation matrix. We generated synthetic 
datasets of input-output pairs {x^y) G x according to linear models of the form 

= x^UA e where U G is a matrix with orthonormal columns, A G 

is the task structure matrix and e is zero-mean Gaussian noise with variance 0.1. The 
inputs X G were sampled according to a Gaussian distribution with zero mean 
and identity covariance matrix. We set the input space dimension d = 100 for our 
experiments. 

In order to quantitatively control the sparsity level of the tasks-relation matrix, we 
randomly generated A so that the ratio between its support (i.e. the number of non-zero 
entries) and the total number of entries would vary between 0.1 (90% sparsity) and 1 
(no sparsity). A Gaussian noise with zero mean and variance 1/10 of the mean value 
of the non-zero entries in A was sampled to corrupt the structure matrix entries (hence, 
the model A was never “really” sparse). This was done to reproduce a more realistic 
scenario. 
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Figure 2: Structure matrix A. True (Left) and recovered by Sparse Kernel MTL (Right). 
We report the absolute value of the entries of the two matrices. The range of values 
goes from 0 (Blue) to 1 (Red) 


We generated multiple models and corresponding datasets for different sparsity 
ratios and number of tasks T ranging from 5 to 20. For each dataset we generated 
respectively 50 samples for training and 100 for test. We performed multi-task regres¬ 
sion using the following methods: single task learning (STL) as baseline, Multi-task 
Relation Learning |[28l (MTRL), Output Kernel Learning @1 (OKL), our Sparse Ker¬ 
nel MTL (SKMTL) and a fixed task-structure multi-task regression algorithm solving 
problem Q using the ground truth (GT) matrix A (after noise corruption) for regular¬ 
ization. We chose least-square loss and performed model selection with five-fold cross 
validation. 

In Figure [2 we report the normalized mean squared error (nMSE) of tested method 
with respect to decreasing sparsity ratios. It can be noticed that knowledge of the true A 
(GT) is particularly beneficial when the tasks share few relations. This advantage tends 
to decrease as the tasks structure becomes less sparse. Interestingly, both the MTRL 
and OKL method do not provide any advantage with respect to the STL baseline since 
we did not design A to be low-rank (or have a fast eigenvalue decay). On the contrary, 
the SKMTL method provides a remarkable improvement over the STL baseline. 

We point out that the large error bars in the plot are due to the high variability of the 
nMSE with respect to the different (random) linear models A and number of tasks T. 
The actual improvement of the SKMTL over the other methods is however significant. 

The results above suggest that, as desired, our SKMTL method is actually recov¬ 
ering the most relevant relations among tasks. In support of this statement we report 
in Figure an example of the true (uncorrupted) and recovered structure matrix A in 
the case of T = 10 and 50% sparsity. As can be noticed, while the actual values in 
the entries of the two matrices are not exactly the same, their supports almost coincide, 
showing that SKMTL was able to recover the correct tasks structure. 
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Accuracy (%) per 
# tr. samples per class 



50 

100 

150 

STL 

72.23 

76.61 

79.23 

±0.04 

±0.02 

±0.01 

MTFL lEl 

73.23 

77.24 

80.11 

±0.08 

±0.05 

±0.03 

MTRLf28| 

73.13 

77.53 

80.21 

±0.08 

±0.04 

±0.05 

OKLia 

72.25 

77.06 

80.03 

±0.03 

±0.01 

±0.01 

SKMTL 

73.50 

78.23 

81.32 

±0.11 

±0.06 

±0.08 


Table 1: Classification results on the 15-scene dataset. Four multi-task methods and 
the single-task baseline are compared. 


4.2 15-Scenes 

We tested SKMTL in a multi-class classification scenario for visual scene categoriza¬ 
tion, the 15-scenes datase|^ The dataset contains images depicting natural or urban 
scenes that have been organized in 15 distinct groups and the goal is to assign each im¬ 
age to the correct scene category. It is natural to expect that categories will share similar 
visual features. Our aim was to investigate whether these relations would be recovered 
by the SKMTL method and result beneficial to the actual classification process. 

We represented images in the dataset with LLC coding (271, trained multi-class 
classifiers on 50, 100 and 150 examples per class and tested them on 500 samples per 
class. We repeated these classification experiments 20 times to account for statistical 
variability. 

In Tablewe report the classification accuracy of the multi-class learning methods 
tested: STL (baseline). Multi-task Feature Learning (MTFL) O, MTRL, OKL and 
our SKMTL. For all methods we used a linear kernel and least-squares loss as plug-in 
classifier. Model selection was performed by five-fold cross-validation. 

As it can be noticed, the SKMTL consistently outperforms all other methods. A 
possible motivation for this behavior, similarly to the synthetic scenario, is that the 
algorithm is actually recovering the most relevant relations among tasks and using this 
information to improve prediction. In support of this interpretation, in Figure we 
report the relations recovered by SKMTL in graph form. An edge between two scene 
categories t and s was drawn whenever the value of the corresponding entry At^ of 
the recovered structure matrix was different from zero. Noticeably SKMTL seems to 

^ http: //w w w- c vr. ai. uiuc. edu/ponce _grp/data/ 
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Figure 3: Tasks structure graph recovered by the Sparse Kernel MTL (SKMTL) pro¬ 
posed in this work on the 15-scenes dataset. 


identify a clear group separation between natural and urban scenes. Furthermore, also 
within these two main clusters, not all tasks are connected: for instance office scenes 
are not related to scenes depicting the exterior of buildings or mountain scenes are not 
connected to images featuring mostly fiat scenes such as highways or coastal regions. 

4.3 Animals with Attributes 

Animals with Attribute^ (AwA) is a dataset designed to benchmark detection algo¬ 
rithms in computer vision. The dataset comprises 50 different animal classes each 
annotated with 85 binary labels denoting the presence/absence of different attributes. 
These attributes can be of different nature such as color (white, black, etc.), texture 
(stripes, dots), type of limbs (hands, flippers, etc.), diet and so on. The standard chal¬ 
lenge is to perform attribute detection by training the system on a predefined set of 40 
animal classes and testing on the remaining 10. In the following we will first discuss 
the performance of multi-task approaches in this setting and then investigate how the 
benefits of multi-task approaches can sometime be dulled by the so-called “negative 
transfer” and how our Sparse Kernel MTL method seems to be less sensitive to such an 

^http:// attributes .kyb.tuebingen .mpg. de/ 
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AUC (%) per #tr. samples per class 



50 

100 

150 

STL 

57.26 ±1.71 

60.73 ± 1.12 

64.37 ± 1.29 

MTFL 

58.11 ± 1.23 

61.21 ± 1.14 

64.22 ±1.10 

MTRL 

58.24 ± 1.84 

61.18 ± 1.23 

64.56 ± 1.41 

OKL 

58.81 ± 1.18 

62.07 ±1.05 

64.26 ± 1.18 

SKMTL 

58.63 ±1.73 

63.21 ± 1.43 

64.51 ± 1.83 


Table 2: Attribute detection results on the Animals with Attributes dataset. 


issue. For the experiments described in the following we used the DECAF features m 
recently made available on the Animals With Attribute website. 

4.3.1 Attribute Detection 

We considered the multi-task problem of attribute detection which consists in 85 clas¬ 
sification (binary) tasks. For each attribute, we randomly sampled 50, 100 and 150 ex¬ 
amples for training, 500 for validation and 500 for test. Results were averaged over 10 
trials. In Tablej^we report the Average Precision (area under the precision/recall curve) 
of the multi-task classifiers tested. As can be noticed for all multi-task approaches, the 
effect of sharing information across classifiers seems to have a remarkable impact when 
few training examples are available (the 50 or 100 columns in Table |^. As expected, 
such benefit decreases as the role of regularization becomes less crucial (150). 

4.3.2 Attribute Prediction - Color Vs Limb Shape 

Multi-task learning approaches ground on the assumption that tasks are strongly related 
one to the other and that such structure can be exploited to improve overall prediction. 
When this assumption doesn’t hold, or holds only partially (e.g. only some tasks have 
common structure), such methods could even result disadvantageous (“negative trans¬ 
fer” El). 

The AwA dataset offers the possibility to observe this effect since attributes are 
organized into multiple semantic groups ESIIH. We focused on a smaller setting by 
selecting only two group of tasks, namely color and limb shape, and tested the effect 
of training multi-task methods jointly or independently across such two groups. For all 
the experiments we randomly sampled for each class 100 examples for training, 500 for 
validation and 500 for test, averaging the system performance over 10 trials. Table 
reports the average precision separately for the color and limb shape groups. 

Interestingly, methods relying on the assumption that all tasks share a common 
structure, such as MTFL, MTRL or OKL, experience a slight drop in performance 
when trained on all attribute detection tasks together (right columns) rather than sep¬ 
arately (left column). On the contrary, SKMTL remains stable since it correctly sepa¬ 
rates the two groups. 
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Area under PR Curve (%) 


Independent Joint 


STL 

MTFL 

MTRL 

OKL 

SKMTL 


Color 

Limb 

Color 

Limb 

74.33 

68.13 

74.33 

68.15 

±0.81 

±0.93 

±0.81 

±0.91 

75.21 

69.41 

74.98 

69.71 

±0.73 

±1.01 

±1.18 

±0.81 

75.17 

69.18 

74.92 

69.73 

±0.53 

±0.64 

±0.78 

±0.75 

74.52 

68.54 

74.31 

68.44 

±0.44 

±0.61 

±0.54 

±0.22 

75.14 

69.21 

75.23 

69.57 

±0.97 

±0.83 

±0.77 

±0.76 


Table 3: Attribute detection on two subsets of AwA. Comparison between methods 
trained independently or jointly on the two sets show the effects of negative transfer. 

5 Conclusions 

We proposed a learning framework designed to solve multiple related tasks while si¬ 
multaneously recovering their structure. We considered the setting of Reproducing 
Kernel Hilbert Spaces for vector-valued functions and formulated the Sparse Ker¬ 

nel MTL as an output kernel learning problem where both a multi-task predictor and 
a matrix encoding the tasks relations are inferred from empirical data. We imposed a 
sparsity penalty on the set of possible relations among tasks in order to recover only 
those that are more relevant to the learning problem. 

Adopting an alternating minimization strategy we were able to devise an optimiza¬ 
tion algorithm that provably converges to the global solution of the proposed learning 
problem. Empirical evaluation on both synthetic and real dataset confirmed the validity 
of the model proposed, which successfully recovered interpretable structures while at 
the same time outperformed previous methods. 

Future research directions will focus mainly on modeling aspects: it will be inter¬ 
esting to investigate the possibility to combine our framework, which identifies sparse 
relations among the tasks, with recent multi-task linear models that take a different 
perspective and enforce tasks relations in the form of structured sparsity penalties on 
the feature space |[T^[29ll . 
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6 Appendix 


6.1 On the (joint) convexity of Sparse Kernel MTL 

As Stated in the paper, it can be shown that the Sparse Kernel MTL problem introduced 
in Eq. is jointly convex in the two optimization variables / and A. The proof of 
this fact requires the introduction of functional analysis tools that are beyond the scope 
of this work. Indeed, according to equation 0 we have observed that it is possible 
to restrict the SKMTL problem to functions of the form /(•) = k{',Xi)bi with 

bi G MA . The following result proves the joint-convexity of Eq. Q for this setting. It 
is an extension of similar results in 1 ^ 1 ^ and we give it here for completeness. 

Proposition 6.1. Let V : ^ MA be a convex loss function. Then the 

functional in problem 0 - restricted to functions f of the form /(•) = ^{’i^i)bi 

with bi G MA - is convex in both f and A. 

Proof Notice that, the only term that requires some care is the component of the func¬ 
tional that is mixing / and A together, namely \\f\\n (where the dependency to A 
is implicit in TL. Indeed, since V is chosen to be convex, the empirical risk term is 
clearly convex in / and does not depend on A, while all the remaining terms are - i.e. 
the tr{A~^), tr{A) and || - penalize only the structure matrix A and are clearly 

convex with respect to it. 

According to Eq. ([^ /(•) = we have that ||/|||^ can be rewrit¬ 

ten as WfWh = tT{B^KBA~^)), with K e Sf the empirical kernel matrix and B G 
I^nxT matrix whose rows correspond to bj. Let us now set b = vec{B) G the 
vectorization of matrix B, obtained by concatenating the columns of B. Then we have 
that 

tr{B~^KBA-^) = b^{A-^ 0 K)b. (11) 

In order to show that the function Q{A, b) = b'^ {A~^ 0 K)b is jointly convex in b and 
A we will show that its epigraph is a convex set. To see this notice that 

epiQ = {(A, 6, c) G X x M | c > {A~^ (g) K)w} 

= {(A,i,,c) e X R"”' X E I ( I ) e SI‘«} 

where the second equality is directly derived from a Schur’s complement argument. 
Consider now any couple of points (Ai, 6 i, ci), (A 2 , 62 , C 2 ) G epiq and any 0 G [0,1]. 
We clearly have that the convex combination 


S, jMi®/.'* h 

b{ Cl J \ H C2 

0Ai 0 Kt + (1 - e)A2 0 0bi + (1 - 0)b2 
0bj + (1 - 0)bj 0ci + (1 - 0)C2 

still belongs to which implies that 

{9Ax + (1 — 0 ) x 42 , Obi + (1 ~ 0)02, Oci + (1 — 0 ) 02 ) € &piQ 


(13) 


(14) 
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therefore proving that Q is jointly convex in b and A. 


□ 


6.2 Cluster Multi-task Learning 

We briefly recall here the Convex Multi-task Cluster Learning proposed in 03 and 
show that it can be cast in the same framework as that of our Sparse Kernel MTL 
model. In particular we comment what choice of constraint set A can be imposed on 
the structure matrix A to recover clustered structures of tasks. 

In the setting proposed by ca, tasks are assumed to belong to one of r of unknown 
clusters, with r fixed a priori. While the original formulation is for the linear kernel, 
it can be easily extended to the non-linear setting of RKHSvv. Let E G {0, be 

the binary matrix whose entry Est has value 1 whenever a task s belongs to cluster t, 
and 0 otherwise. Let L be the normalized Laplacian of the Graph defined hy E. Set 
M = I — L, and U = ^11^. As we have observed in Eq. ([^, the regularizer \\f\\n 
depends on A~^. The role of this term could be shaped to reflect the structure of the 
clusters encoded in the Laplacian L, hence in the matrix M. As noted in ifTSl A~^{M) 
can be chosen so that: 

A-\M)=eMU + eBiM-U) + ew{I-M), (15) 

where the first term is a global penalty on the average predictor, the second term pe¬ 
nalizes the between cluster variance, and the third term penalizes the within cluster 
variance. Since M belongs to a discrete set, the authors propose a relaxation for M by 
constraining it to be in a convex set Sc = {M G 5'^, 0 ^ M ^ tr(M) = r} which 
directly induces a set A of spectral constraints for A. 


19 


